chore: rolling promotion dev -> main#589
Conversation
|
Important Review skippedToo many files! This PR contains 295 files, which is 145 over the limit of 150. To get a review, narrow the scope: Upgrade to a paid plan to raise the limit. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (295)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request significantly expands the Omni platform's capabilities by introducing a Twilio WhatsApp channel plugin, a host-fingerprint trust system for Genie hosts, and an output-redactor middleware for data privacy. It also hardens the system with idempotency guards for NATS subscribers, transient error retries for agent dispatches, and improved date validation across API endpoints. Review feedback identified critical bugs including a missing import in the media processor and potential data loss in chats.settings and sync job progress updates due to non-merging object replacements. Furthermore, the polling mechanism for linking media to events was noted as an area for performance optimization.
| const [event] = await ctx.db | ||
| .select({ id: omniEvents.id }) | ||
| .from(omniEvents) | ||
| .where(eq(omniEvents.id, eventId)) |
| await services.chats.update(data.chatId, { | ||
| settings: { agentPaused: true }, | ||
| }); |
There was a problem hiding this comment.
Accidental data loss in chats.settings. This update replaces the entire settings JSONB object with { agentPaused: true }, clobbering any existing metadata keys such as followUpConfig or closeOutcome. Per general rules, metadata updates should merge with existing data rather than replacing the entire object.
| await services.chats.update(data.chatId, { | |
| settings: { agentPaused: true }, | |
| }); | |
| const chat = await services.chats.getById(data.chatId); | |
| await services.chats.update(data.chatId, { | |
| settings: { ...((chat?.settings as Record<string, unknown>) ?? {}), agentPaused: true }, | |
| }); |
References
- When implementing an update command for a resource that has a metadata field, the update operation should merge the new metadata with the existing metadata rather than replacing it, to prevent accidental data loss of other metadata keys.
| await services.chats.update(data.chatId, { | ||
| settings: { | ||
| ...(shouldPauseAgent ? { agentPaused: true } : {}), | ||
| closed: terminal, | ||
| closeUntil: closeUntil?.toISOString() ?? null, | ||
| closeOutcome: outcome, | ||
| } as Record<string, unknown>, | ||
| }); |
There was a problem hiding this comment.
Accidental data loss in chats.settings. Similar to the handoff path, this update clobbers the entire settings JSONB column. Existing configuration keys like followUpConfig will be lost. You must fetch the current settings and merge them before updating.
| await services.chats.update(data.chatId, { | |
| settings: { | |
| ...(shouldPauseAgent ? { agentPaused: true } : {}), | |
| closed: terminal, | |
| closeUntil: closeUntil?.toISOString() ?? null, | |
| closeOutcome: outcome, | |
| } as Record<string, unknown>, | |
| }); | |
| const chat = await services.chats.getById(data.chatId); | |
| await services.chats.update(data.chatId, { | |
| settings: { | |
| ...((chat?.settings as Record<string, unknown>) ?? {}), | |
| ...(shouldPauseAgent ? { agentPaused: true } : {}), | |
| closed: terminal, | |
| closeUntil: closeUntil?.toISOString() ?? null, | |
| closeOutcome: outcome, | |
| } as Record<string, unknown>, | |
| }); |
References
- When implementing an update command for a resource that has a metadata field, the update operation should merge the new metadata with the existing metadata rather than replacing it, to prevent accidental data loss of other metadata keys.
| async function resolveSafeMediaContentEventId( | ||
| ctx: MediaProcessorContext, | ||
| eventId: string | undefined, | ||
| ): Promise<string | null> { | ||
| if (!isUuid(eventId)) return null; | ||
|
|
||
| // media_content is audit/replay metadata, so do not block media.processed for long. | ||
| // Event persistence runs concurrently with this processor and should normally win within milliseconds. | ||
| const maxWaitMs = 250; | ||
| const pollMs = 50; | ||
| const deadline = Date.now() + maxWaitMs; | ||
|
|
||
| while (true) { | ||
| try { | ||
| const [event] = await ctx.db | ||
| .select({ id: omniEvents.id }) | ||
| .from(omniEvents) | ||
| .where(eq(omniEvents.id, eventId)) | ||
| .limit(1); | ||
|
|
||
| if (event) return event.id; | ||
| } catch (error) { | ||
| log.debug('Failed to validate media_content event FK', { eventId, error: String(error) }); | ||
| return null; | ||
| } | ||
|
|
||
| if (Date.now() >= deadline) { | ||
| log.debug('Skipping media_content event FK; omni_event not found', { eventId }); | ||
| return null; | ||
| } | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, pollMs)); | ||
| } | ||
| } |
There was a problem hiding this comment.
The polling loop in resolveSafeMediaContentEventId is an inefficient way to handle the race condition between event persistence and media processing. While 250ms is a short window, this pattern results in multiple unnecessary database queries and holds up the event handler. Consider using a database-level constraint or a more reactive approach (e.g., a background worker or a retry mechanism in the event bus) to link media content to events.
| const update: Partial<SyncJobProgress> = { | ||
| fetched: payload.fetched ?? 0, | ||
| stored: 0, | ||
| duplicates: 0, | ||
| mediaDownloaded: 0, | ||
| totalEstimated: | ||
| payload.progress && payload.progress > 0 | ||
| ? Math.round((payload.fetched ?? 0) / (payload.progress / 100)) | ||
| : 0, | ||
| }); | ||
| }; |
There was a problem hiding this comment.
Potential counter reset bug. If payload.fetched is undefined, it defaults to 0, which will clobber the existing fetched count in the database during the updateProgress merge. The update object should only include the fetched key if the value is explicitly provided in the payload to avoid accidental data loss.
const update: Partial<SyncJobProgress> = {};
if (payload.fetched !== undefined) {
update.fetched = payload.fetched;
}References
- When implementing an update command for a resource that has a metadata field, the update operation should merge the new metadata with the existing metadata rather than replacing it, to prevent accidental data loss of other metadata keys.
|
@codex review this pr |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3e765a0b4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let instance: string | null = instanceId; | ||
| let chat: string | null = chatId; | ||
| let recipient: string | null = null; | ||
|
|
||
| // Path-param extraction: /instances/:id, /chats/:id | ||
| const pathInstance = PATH_INSTANCE_PREFIXES.map((p) => firstPathSegment(cleanPath, p)).find((v) => v != null) ?? null; | ||
| const pathChat = PATH_CHAT_PREFIXES.map((p) => firstPathSegment(cleanPath, p)).find((v) => v != null) ?? null; | ||
| if (!instance && pathInstance) instance = pathInstance; | ||
| if (!chat && pathChat) chat = pathChat; |
There was a problem hiding this comment.
Prioritize path instance/chat targets over JSON body
The target extraction currently trusts instanceId/chatId from the request body before path params, and only falls back to path values when the body omits them. Because requireSignedInstanceMiddleware and allowlist checks both consume extractLockTargets, a caller can hit a path-scoped endpoint like PATCH /api/v2/instances/:id and inject a different instanceId in JSON to make authorization decisions against the wrong instance/chat. This can bypass require_genie_signature and allowlist enforcement for path-targeted writes.
Useful? React with 👍 / 👎.
| agentId: instance.agentId ?? null, | ||
| outcome, | ||
| reason: data.reason ?? null, | ||
| escalated: terminal, |
There was a problem hiding this comment.
Publish actual escalation flag in chat.closed payload
The emitted chat.closed event sets escalated to terminal instead of the computed escalated value, so hard-terminal outcomes (won/lost) are always reported as escalations even when no threshold-based auto-promotion happened. This corrupts downstream audit/BI semantics and any automation that distinguishes explicit hard closes from repeated soft-close escalation.
Useful? React with 👍 / 👎.
bun publish started returning `404 Not Found` for scoped packages on 2026-05-07 even though npm whoami via the same NPM_TOKEN succeeds and the token has read+write access to @automagik/omni. The last successful publish via `bun publish` was 2.260506.1 on 2026-05-06 15:06; every publish since fails with the same error pattern: Tag: next Access: public Registry: https://registry.npmjs.org/ 404 Not Found: https://registry.npmjs.org/@automagik%2fomni - '@automagik/omni@<v>' does not exist in this registry Workflow code, bun version (1.3.13), and registry URL all unchanged across the success→failure boundary, so this is most likely a bun publish auth-header regression for scoped packages when given env-var auth only (no .npmrc written). Switch to the canonical npm setup: 1. Write ~/.npmrc with the registry's _authToken 2. Use `npm publish` (rock-solid, well-instrumented error messages) Net behavior preserved: same package, same scope, same tag, same access mode. Only the publisher binary + auth wiring changed.
…ublish ci(version): use npm publish + explicit .npmrc to fix scoped 404
Drop NPM_TOKEN dependency entirely. Mirror the proven pattern from
automagik/genie .github/workflows/version.yml — short-lived OIDC tokens
exchanged at workflow runtime, no long-lived secret to rotate, no 2FA-
bypass gymnastics.
Why now:
The classic NPM_TOKEN path silently 404'd starting 2026-05-06 evening.
Diagnosis showed token authenticated fine for whoami + reads; PUT (publish)
returned 404 — npm's stand-in for "this token can't bypass 2FA on publish
for this scope". Granular access tokens don't bypass 2FA by default.
Fix: stop using tokens. npm "Trusted Publishers" trusts the workflow
identity directly via OIDC. Felipe saved the trusted-publisher record
on npmjs.com pointing at automagik-dev/omni workflow version.yml (no env)
in parallel with this PR.
Three workflow changes, all mirroring genie's version.yml verbatim where
applicable:
1. Top-level permissions: + id-token: write
(required to mint OIDC tokens at runtime)
2. New step before publish: Upgrade npm to >= 11.5.1
Node 22 bundles npm 10.x which sends a placeholder OIDC token and
gets back a misleading 404 — fixed in npm 11.5.1+. Install latest.
3. Publish step rewrite:
- Removed: NPM_TOKEN env var, .npmrc write, skip-guard, NPM_CONFIG_TOKEN
- Added: NPM_CONFIG_PROVENANCE=false (Blacksmith runners fail sigstore
attestation with 422 even though OIDC token exchange itself works)
- Kept: HUSKY=0, the cd packages/cli && npm publish invocation
Comment block above the publish step calls out the 3 traps a future
maintainer would otherwise rediscover the hard way:
- npm 10 placeholder-token 404
- Blacksmith provenance 422
- workflow filename binding to the npmjs.com trusted-publisher record
NPM_TOKEN repo secret left in place during transition (defense in depth).
Cleanup is a follow-up wish after ≥3 successful OIDC publishes.
Wish: .genie/wishes/npm-trusted-publisher-oidc/WISH.md (bundled in this PR)
…-oidc ci(version): switch npm publish to OIDC trusted publishing
The "Upgrade npm for OIDC trusted publishing" step (npm install -g npm@latest) writes to /usr/local. omni's Blacksmith pool image rejects this with EACCES on /usr/local/share/man/man7, blocking the publish step that follows. GitHub-hosted ubuntu-latest gives the runner user write access to /usr/local, so npm -g installs cleanly. Trade-off: slower job startup (~30s vs. Blacksmith's ~5s) for publish reliability — acceptable for a once-per-PR workflow. Note: genie's identical workflow runs successfully on Blacksmith. The EACCES is specific to omni's Blacksmith image — likely a /usr/local ownership/perm difference between the two pools' base images. Whole-job switch (vs. splitting into bump-on-Blacksmith + publish-on- GitHub) keeps the workflow simple — no artifact passing between jobs.
…hosted ci(version): use ubuntu-latest for Auto Version (Blacksmith /usr/local EACCES blocks OIDC)
The previous attempt (#615 — switch to ubuntu-latest) didn't fix it. Both Blacksmith and ubuntu-latest's bundled Node lives at /usr/local with permissions the runner user can't write to. `npm install -g npm@latest` therefore fails with EACCES on /usr/local/share/man/man7 on either runner. setup-node@v5 installs Node 22 into the runner-user-writable tool_cache. Subsequent `npm -g` writes go to a user-owned path. No EACCES, no runner gymnastics. Keeping ubuntu-latest from #615 — Felipe's call that OIDC is unstable on Blacksmith. setup-node is the actual EACCES fix on top.
ci(version): add actions/setup-node to fix EACCES on npm -g install
…ND on promise-retry
ci(version): add --force to npm self-upgrade
…grade ci(version): use Node 24 to skip broken npm self-upgrade
`buildPm2StartArgs` did not pass `--cwd`, so pm2 inherited the calling shell's cwd and baked it into the omni-api / omni-nats entries. When that directory was later removed (transient build dir, deleted submodule, switched git branch), every subsequent `pm2 restart` failed silently with `Error: spawn bash ENOENT` because pm2 chdir'd into the missing path before exec. The 2026-05-07 incident: `omni update --next` ran from a shell whose cwd was `repos/pgserve` (a now-removed canonical-pgserve experiment directory). `startServices` did `pm2 delete omni-api && pm2 start ...` without `--cwd`, baking that path. Subsequent restarts (from the SessionStart hook, from `omni doctor --fix`, from pm2 max_restarts backoff) all failed with ENOENT and zero log output, because the launcher bash never even started. doctor's `cli-key-valid` rotation path then dutifully reported "rotated key does not validate" because the API it was validating against had no listener at all. The omni-server launcher (`bin/omni-server`) does its own `cd "$(dirname "$0")/../dist/server"` at line 6, so pm2's cwd does not need to point anywhere meaningful — it just needs to point somewhere that exists for the lifetime of the user account. `$HOME` is the obvious anchor. Adds: - `getPm2AnchorCwd()` in `pm2.ts` returning `homedir()`. - `--cwd <anchor>` injected by `buildPm2StartArgs` for both api and nats launches, so `startServices` (install/update), `fixPgserveCanonical`, `fixPm2EnvDrift`, and `fixPm2MaxRestarts` all benefit from a single change. - Regression tests asserting `--cwd` is present and equals `$HOME` for both api and nats argv.
fix(cli): anchor pm2 --cwd to $HOME so restarts cannot ENOENT
Regression introduced by 01b5511 (`ci(version): switch npm publish to OIDC trusted publishing`, 2026-05-07): only `version.yml` was migrated; `release.yml` continued using `NPM_TOKEN`. Once npm Trusted Publishing was registered for `@automagik/omni` (so version.yml could publish via OIDC), the package's legacy-token publish path stopped accepting `NPM_TOKEN` and started returning a misleading 404 — exactly the trap version.yml's own comment warns about: > Without [id-token: write], the publish step gets an empty > placeholder token and the registry returns a misleading 404 > instead of the real auth failure. Symptom from the failing run: Tag: latest Access: public Registry: https://registry.npmjs.org/ 404 Not Found: https://registry.npmjs.org/@automagik%2fomni '@automagik/omni@2.260508.1' does not exist in this registry Error: Process completed with exit code 1 The 404 fired during the `npm dist-tag add` retag fallback, after `bun publish` already failed because the package's only authorized publisher is now the OIDC Trusted Publisher entry pointing at version.yml — release.yml was effectively unauthenticated. Fix mirrors the version.yml pattern verbatim: - Add `id-token: write` permission so the runner can mint OIDC tokens. - Switch `runs-on` from `blacksmith-4vcpu-ubuntu-2404` to `ubuntu-latest`. Setup-node's tool_cache resolves under /usr/local on the Blacksmith image where the runner user gets EACCES; GitHub- hosted runners give the runner user write access there. (Same trade-off documented on version.yml.) - Add `actions/setup-node@v5` with `node-version: '24'` and `registry-url: 'https://registry.npmjs.org'`. Node 24 ships npm 11.x; Trusted Publishing requires npm >= 11.5.1. - Replace `bun publish` with `npm publish`. Drop the manual `~/.npmrc` write and the NPM_TOKEN env vars — OIDC handles auth for both `npm publish` AND the `npm dist-tag add` retag fallback. - Set `NPM_CONFIG_PROVENANCE: "false"` (npm auto-enables provenance whenever id-token is writable; sigstore attestation 422s on GitHub-hosted runners for this package). IMPORTANT: this file is now a publisher of @automagik/omni from npm's perspective. Both version.yml AND release.yml must be registered as Trusted Publishers at https://www.npmjs.com/package/@automagik/omni/access. The included comment block on the auth step calls this out so a future rename does not silently 404 again.
ci(release): migrate npm publish to OIDC Trusted Publishing
…ied-entry-flow-payload fix(channel-gupshup): accept simplified HV-Entry-Flow payload
Follow-up to #707. - Synthesized dedupe id now uses a content digest (FNV-1a) instead of text length, so two distinct messages from the same sender in the same second no longer collide and get one silently dropped as a duplicate. - normalizeSimplifiedWebhook returns null when there's no text, so an empty simplified payload drops instead of dispatching an empty inbound. - Tests: distinct same-second equal-length ids stay distinct; no-text payloads normalize to null.
…upe-collision fix(channel-gupshup): harden simplified-payload id + drop empty text
Addresses three P1 findings from the #589 bot review: extractLockTargets resolved authorization targets from path params and the JSON body only, with the body taking precedence. This allowed: • Scope bypass — a caller could hit a path-scoped write (e.g. PATCH /instances/:realId) and inject body.instanceId pointing at an allowlisted instance, so allowlist + require_genie_signature checks evaluated against the wrong instance. • Missed enforcement / false 403 — header-scoped routes (POST /turns/close targets via x-omni-instance / x-omni-chat) were invisible to the extractor, so signature enforcement was skipped and allowlisted profile keys were wrongly denied. Fix: • extractLockTargets now also reads x-omni-instance / x-omni-chat (via the new readHeaderTargets helper) and applies precedence path > header > body. Route-derived targets (path, header) are trusted; the caller-controllable body can no longer override them and is used only as a last resort. • require-signed-instance + scope-enforcer middleware both pass header targets through. Tests: path/header beat conflicting body targets; headers populate targets for header-scoped routes; body still used when no path/header target.
…targets-precedence fix(api): close scope-enforcer body-injection + header-target gaps (P1)
The 'is synchronous and fast (<1ms)' test asserted 1000 recordCheckpoint calls finish in <100ms wall-clock. That's an environmental perf budget, not a behavioral guarantee — it fails intermittently on a loaded CI host (observed 110ms) while passing in isolation. Rewritten to assert the actual intent: recordCheckpoint returns synchronously (no thenable) and completes its work in-call (journey present immediately after), with a generous 1000ms ceiling kept only as a regression backstop.
…er-timing test(core): de-flake JourneyTracker hot-path timing assertion
Addresses #589 review findings #5, #6, #7 (verified still-present on dev). Settings clobber (#5, #6 + a third matching site): Several writes set chats.settings to a freshly-built object, which replaces the entire JSONB column and drops unrelated keys (followUpConfig, close*, agentResumedAt, …). Fixed to merge over the existing settings — matching the established pattern in reopen-contact and PUT /follow-up/chats/:id: • POST handoff (agentPaused: true) • POST close-contact (agentPaused/closed/closeUntil/closeOutcome) • clear-session resume (agentPaused: false) — same bug, not in the review chat.closed escalated flag (#7): The event published escalated: terminal, so hard-terminal closes (won/lost, which are terminal:true but escalated:false) were always reported as escalations, corrupting BI/audit + any automation distinguishing hard closes from threshold escalations. Now publishes the computed escalated value (the HTTP response already returned the correct one). Service-layer merge was rejected on purpose: DELETE /follow-up/chats/:id relies on a wholesale replace to drop a key, so merging must stay at the call sites.
…d-escalated-flag fix(api): merge chat.settings on writes + publish real escalated flag
…n hint (#722) Partial fix for #722 (embedded->canonical upgrade break). Two operator-facing gaps fixed; the host-tooling-free data copy is a separate follow-up. #1 fail-fast (root cause: opaque crash loop): The dev API only fail-fasts on PGSERVE_EMBEDDED=true, which legacy embedded installs (pre-canonical-cutover) never set — so they fell into the API's 30x 'Database not ready' retry crash loop with no guidance. `omni start` now detects the legacy-embedded shape (no useCanonicalPgserve flag + an embedded data dir present) and exits with the `omni doctor --fix` hint before starting omni-api. Extracted legacyEmbeddedNeedsMigration() + tests. #3 misleading remediation hint: dumpEmbeddedDb/restoreSnapshotToCanonical told operators to `apt install postgresql-client`, which on Ubuntu 24.04 is PG16 and refuses to dump a PG18 server. New clientToolHint() names the REQUIRED major from the embedded cluster's PG_VERSION (e.g. postgresql-client-18 / brew postgresql@18) and warns the bare package is often too old. Validated end-to-end in an isolated sandbox: omni start now fails fast with exit 1 + the migration hint; doctor --fix surfaces the major-aware client hint. Refs #722.
…-migration fix(cli): fail-fast on legacy embedded pgserve + major-aware migration hint (#722)
…s.js (#722) The embedded->canonical migration copied data by shelling out to host psql/pg_dump — tools omni doesn't bundle and that, when present from a distro, are often an older major than the embedded server (PG18) and refuse to run. That's the root of the #722 upgrade break. Replace the psql transport with postgres.js COPY streams (a dependency the repo already ships in @omni/db): - pgConnect() opens a wire connection to the temp embedded reader + canonical - listTables/listColumns/copyTable/resetSequences now run over the wire - copyTable streams COPY ... TO STDOUT -> COPY ... FROM STDIN with pipeline() backpressure (kills the EPIPE-on-large-rows failure the old psql->file buffering worked around) - one reserved dst connection holds session_replication_role='replica' across TRUNCATE + every COPY + sequence resets (it's a session GUC) Net: migrateUnmountedEmbeddedToCanonical + compareEmbeddedVsCanonicalCounts need NO host client tools — only the matching-major server reader, already auto-fetched. Verified in a sandbox: postgres.js text-COPY round-trips jsonb/bytea/int[]/nulls byte-for-byte. Extracted resolveTablesToMigrate/copyAllTables/countRowDivergence to stay under the cognitive-complexity ceiling. Existing migration + doctor tests green. Follow-up (same issue): wire the primary fixPgserveCanonical path (schema-via-migrations + this copier) so `omni doctor --fix` is fully automatic; today it still uses the pg_dump/psql snapshot path. Refs #722.
Synchronous + instant, but the default 5s per-test timeout flakes under heavy host load (CPU starvation inflates wall-clock >5s), intermittently failing the full-suite pre-push gate. 30s headroom — same class of fix as the JourneyTracker de-flake.
…n-pgjs refactor(cli): host-tooling-free embedded→canonical migration via postgres.js (#722)
…ing-free) (#722) Rewire fixPgserveCanonical off the pg_dump/psql snapshot path onto the host-tooling-free postgres.js copier (#724). New flow: stop omni-api → pgserve install → persist canonical config → delete+start omni-api on canonical (startup runs drizzle migrations → schema) → wait for /health → copy data via migrateEmbeddedData (postgres.js COPY over the wire). The schema is created by omni-api's own migrations, so the copy is data-only. No pg_dump/psql, no client-major mismatch, no manual steps — a fresh `omni doctor --fix` now completes the embedded→canonical migration on its own. Failure semantics for the new architecture (this build can't run embedded): - setup fails (pre-flip): restart omni-api on embedded, FAIL. - after the config flip canonical is the target; a health-timeout or copy failure leaves canonical healthy/empty with the embedded data INTACT, and the idempotent embedded-data-orphaned check re-runs the copy next time. Added DoctorDeps.migrateEmbeddedData (stubbable) + waitForApiHealthy. Rewrote the pgserve-canonical fix tests for the new ordering + failure modes (dump/restore are no longer invoked; copy-skip, health-timeout, copy-throw covered). dumpEmbeddedDb/restoreSnapshotToCanonical remain for now (still referenced by update-maintenance tests) but are no longer used by the fix. Orchestration validated by 49 doctor unit tests; the postgres.js COPY engine was proven byte-for-byte in #724. Closes #722.
The clean-room e2e surfaced two more psql shell-outs in the canonical setup
path that the doctor-fix rewire depends on — both assumed `psql` is on PATH,
which is false on a fresh host (autopg/pgserve ship only initdb/pg_ctl/postgres,
no client tools). Bun.spawn throws on a missing executable, aborting the
migration with 'Executable not found in $PATH: psql'.
Converted both to postgres.js (already a CLI dep):
- canonical-pgserve.ts ensureOmniDatabaseExists: CREATE DATABASE via wire,
42P04 → already-exists → success (idempotent).
- role-cutover.ts runProvisioningSql (was runPsql): runs the role-create +
grants scripts as a simple query over the local unix socket as postgres
superuser; fails-fast on first error (ON_ERROR_STOP equivalent).
With these, `omni doctor --fix` needs ZERO host postgres client tools.
Validated end-to-end in a clean container: @latest embedded install + seeded
canary → upgrade to patched @next → `omni doctor --fix` → 13 OK / 0 FAIL,
omni-api healthy on canonical, 30 tables copied, canary instance preserved
(identical UUID).
Refs #722.
fix(cli): fully automatic, host-tooling-free omni doctor --fix migration (#722)
…#722) Clean-room re-validation of the update-from-dev path surfaced a first-pass race: when `pgserve install` provisions canonical DURING `omni doctor --fix`, the freshly-started postmaster may not accept connections for a second or two. ensureOmniDatabaseExists + the role-cutover connect immediately after, the connect is refused, both no-op (best-effort), and omni-api is left crash-looping on a missing `omni` db — the operator had to run `omni doctor --fix` a second time for it to succeed. Add waitForCanonicalReady(port): poll the postmaster (postgres.js SELECT 1, up to ~30s) after `pgserve install` and before db/role provisioning. Validated in a fresh container: @latest embedded + seeded canary → update to this build → a SINGLE `omni doctor --fix` migrates 30 tables to canonical, omni-api healthy on canonical, canary preserved. (Previously needed two passes.) Refs #722.
…s-wait fix(cli): wait for canonical postmaster readiness before provisioning (#722)
Rolling Promotion PR
Auto-maintained rolling promotion PR from
devtomain.Process:
ready-to-mergeadded when all checks passNotable changes in this promotion
This rolling batch promotes the full
devbacklog (~550 commits). Highlights from recent work:🔒 Security — scope-enforcer P1 fixes (#719)
Closed three P1 authorization gaps surfaced by the review bots:
extractLockTargetstrusted the request body over path params, letting a caller authorize against one instance/chat while operating on another. Path/header targets now win over the (caller-controlled) body.x-omni-instance/x-omni-chat, e.g.POST /turns/close) — those targets are now extracted and enforced.🐛 Data correctness (#721)
chats.settingsclobber — handoff, close-contact, and clear-session paths replaced the wholesettingsJSONB, droppingfollowUpConfig/close*keys. Now merge over existing settings.chat.closedevent publishedescalated: terminal— hard closes (won/lost) were always reported as escalations. Now publishes the computedescalated.🔄 Embedded → canonical pgserve upgrade — fully automatic & host-tooling-free (#722 — PRs #723, #724, #725, #727)
Upgrading an embedded-Postgres install to canonical pgserve used to crash-loop and dead-end (
omni doctor --fixneededpg_dump/psql, which aren't bundled and were often the wrong major). Now:omni startfails fast on legacy embedded with an actionable hint (no opaque 30× crash loop); thedoctor --fixclient-tool hint names the required PG major.COPYstreams (over the wire, backpressured) — zero host client tools; byte-for-byte fidelity for jsonb/bytea/arrays/nulls.doctor --fixpath wired to the tool-free copier (schema via omni-api's own migrations); the last twopsqlshell-outs in canonical db/role provisioning converted to postgres.js.doctor --fix).Validated end-to-end in clean containers:
@latestembedded install + seeded data → update → singleomni doctor --fix→ migrated to canonical, omni-api healthy, all entity types preserved with identical IDs (instances, agent providers, agents, routes, persons/identities, chats, participants, messages, follow-up config, automations, webhooks). Only message media blobs (auto-rehydrated from the channel) and API keys (canonical keeps its own) are intentionally not copied.✅ Test reliability (#720 + #727 follow-on)
De-flaked load-sensitive timing tests (
JourneyTracker,writeDiagnostics) that intermittently failed the full-suite gate under host CPU starvation.